Main packages used: ggplot2, ggraph
Main functions covered: ggplot(), geom_*(), scale_*_*(), labs(), theme_*()

Supplementary resources:

2 Data visualization principles

Minimize noise, maximize signal in your graphs (or put it in other ways: maximize the data-ink ratio):

source: Darkhorse Analytics

  • Do not lie with your visualization
  • Avoid chart junk
  • Choose the type of plot depending on the type of data
  • Label chart elements properly and informatively
  • Be mindful of where your x and y axis starts (scales can be really deceiving otherwise)
  • Use consistent units (do not mix yearly and monthly GDP for example)
  • ABSOLUTELY NO 3D (PIE) CHARTS.

Appropriate reaction to 3D charts:

3 ggplot2 and its extensions

The name stands for grammar of graphics and it enables you to build your plot layer by layer and having the ability to control every detail of the output (if you so wish). It is used by many in academia, by the Financial Times and FiveThirtyEight writers, among many others. During this workshop we will go through various types of data visualisations and try to apply the above set principles to our output.

You create plots with the below syntax:

Let’s load our data that we’ll be using this session. (don’t worry about the message about the duplicated column names for now)

3.1 The logic of a ggplot2 plot

We now have some experience with making nice figures with ggplot2. To kickstart this session, let’s review how a plot is made and extend our knowledge slightly. We will use the iris dataset for this purpose. This section was insipired by the great RLadies presentation of Eva Maerey

First, we specify the data we want to use within our ggplot() function call with the data = argument.

Second, we decide on the dimensions of our data. Let’s start by specifying what to plot on the y and x axes. This is done within the aes() argument, which stands for ‘aesthetic’.

Third, we add our wanted representation of the data, with the geom_ function family.

Fourth, we can add further dimension to our plot by extending the aes() arguments. Let’s add colors based on the Species variable.

Fifth, each aesthetic can be rescaled. We saw this with the GDP data before. Now we want to rescale our colors. We will use the manual color scale to specify each value. Colors can be added as HEX code, or names.

Sixth, we can modify the textual elements of our plot. To do this, we can assign a string to every text element with the labs function. As we see the color aesthetic created automatically a legend on the side. We can remove the title of it should we want it.

Finally, we decide on the theme of our hearts. ggplot2 offers an ocean of customization options for our plot, there are some premade themes btu we can create our own as well. Now we will stick to theme_minimal().

3.2 Scatter plot

We use scatter plot to illustrate some association between two continuous variable. Usually, the y axis is our dependent variable (the variable which is explained) and x is the independent variable, which we suspect that drives the association.

Now, we want to know what is the association between the GDP per capita and life expectancy

Now that we have a basic figure, let’s make it better. We transform the x axis values with the scale_x_log10() and add text to our plot with the labs() function. Within geom_point() we can also specify geom specific options, such as the alpha level (transparency).

To add some analytical power to our plot we can use geom_smooth() and choose a method for it’s smoothing function. It can be lm, glm, gam, loess, and rlm. We will use the linear model (“lm”). Note: this is purely for illustrative purposes, as our data points are country-years, so “lm” is not a proper way to fit a regression line to this data. This example also shows how to plot two geoms into one figure.

what if we want to see how each continent fares in this relationship? We need to include a new argument in the mapping function: color =. Now it is clear that European countries (country-years) are clustered in the high-GDP/high life longevity upper right corner.

We add horizontal line or vertical line to our plot, if we have a particular cutoff that we want to show. We can add these with the geom_hline() and geom_vline() functions.

3.3 Histogram

Using histograms to check the distribution of your data as we have seen in the intro sessions.

To add some flair to our figure, we use color and fill inside the geom_ call. What is the difference between the two?

We can overlay more than one histogram on each other. See how different iris species have different sepal length distribution.

Notice, how we used the fill as a mapping aesthetic rather than in the previous example. This way, the fill = variable applies to the whole plot, not just to the geom only.

3.4 Density plots

A variation on histograms is called density plots that uses Kernel smoothing (fancy! but in reality is a smoothing function which uses the weighted averages of neighboring data points.)

Add some fill

Your intutition is correct, we can overlap this with our histogram. To keep the y axis consistent between the histogram and the density plot, we use the ..density.. term for the geom_histogram to avoid having the frequency on the y axis.

And similarly to the historgram, we can overlay two or more density plot as well.

3.5 Ridgeline / Joyplot

This one is quite spectacular looking and informative. It has a similar function as the overlayed histograms but presents a much clearer data. For this, we need the ggridges package which is a ggplot2 extension.

3.6 Bar charts

We can use the bar charts to visualise categorical data. Let’s prep some data. (for refresher, check the first session on factors!) For diversifying our approaches for educational purposes this recoding is done in base R but we could have done it in dplyr as well.

Let’s see the political interest of the Hungarian people.

We can use the fill option to map another variable onto our plot. Let’s see how these categories are further divided by the gender of the respondents. By default we get a stacked bar chart.

we can use the position function in the geom_bar to change this. Another neat trick to make our graph more readable is coord_flip.

Let’s make sure that the bars are proportional. For this we can use the y = ..prop.. and group = 1 arguments, so the y axis will be calculated as proportions. The ..prop.. is a temporary variable that has the .. surrounding it so there is no collision with a variable named prop.

Combining categorical data and continuous data and using group by is also doable. We just create a grouped data and have the needed variables computed, then plot it.

3.6.1 Lollipop charts

The lollipop chart is a better barchart in a sense that it conveys the same information with better data/ink ratio. It also looks better. (note: some still consider it a gimmick)

For this we will modify a chart from the Data Visualisation textbook

This chart is built in a more complex way as we have to draw the lines and the dots separately. We draw the lines with the geom_segment that requires a starting value and ending value for both the x and y axis. The dots are drawn with the geom_point and the colors are from a dummy variable in the dataset.

4 Themes and plot elements

4.1 Themes

In this section we will go over some of the elements that you can modify in order to get an informative and nice looking figure. ggplot2 comes with a number of themes. You can play around the themes that come with ggplot2 and you can also take a look at the ggthemes package, where I included the economist theme. Another notable theme collection is the hrbthemes package. The BBC also published their R package which they use to create their graphics. You can find it on GitHub here: https://github.com/bbc/bbplot

Try out a couple to see what they differ in! The ggthemes package has a nice collection of themes to use. The theme presets can be used with the theme_*() function.

One of my personal favourite is the theme_minimal()

4.2 Plot elements

Of course we can set all elements to suit our need, without using someone else’s theme.

The key plot elements that we will look at are:

  • labels
  • gridlines
  • fonts
  • colors
  • legend
  • axis breaks

Adding labels, title, as we did before.

Let’s use a different color scale! We can use a color brewer scale (widely used for data visualization). To check the various palettes, see http://colorbrewer2.org

Or we can define our own colors:

To clean up clutter, we will remove the background, and only leave some of the grid behind. We can hide the tickmarks with modifying the theme() function, and setting the axis.ticks to element_blank(). Hiding gridlines also requires some digging in the theme() function with the panel.grid.minor or .major functions. If you want to remove a gridline on a certain axis, you can specify panel.grid.major.x. We can also set the background to nothing. Furthermore, we can define the text attributes as well in our labels.

Finally, let’s move the legend around. Or just remove it with theme(legend.position="none"). We also do not need the background of the legend, so remove it with legend.key, and play around with the text elements of the plot with text.

While we are at it, we want to have labels for our data. For this, we’ll create a plot which can exploit this.

What we use is the geom_text to have out labels in the chart.

To avoid overlapping text, use the ggrepel package which provides this functionality via the ggrepel::geom_text_repel and the ggrepel::geom_label_repel functions.

Without

notice the different outcome of geom_label instead of geom_text.

If we want to label a specific set of countries we can do it from inside ggplot, without needing to touch our data.

5 Special cases

5.1 Network visualization

Let’s load our data from an edgelist. We are using the tidygraph ggraph packages, but both are heavily dependent on the igraph package which is one of the most powerful one for network analysis in R.

let’s create the network object and add some network statistics to our small social network

We plot the network with the ggraph() function, that is a network oriented extension of ggplot2. The nodes and links are plotted separately with the geom_edge_* and geom_node_*. In this case link and point.

alternative with modifications to link and node attributes. Note the theme_graph() at the end.

final touch, let’s add the communities in the network and labels for our nodes.

5.2 Maps

Two essential parts of creating a map with ggplot2: - shapefile which draws the map - some data that we want to plot over the map

Getting the map data from the maps package

We can plot the empty map

We can also subset the map data, just as we can with any other R object

We can add data to our map. We subset our gapminder data for the year 1977. Then add a new row that matches the region variable in the map data so we can merge the two dataset. (we also get rid of Antarctica, because of aesthetics)

And now we can plot the map and data with the geom_polygon() and coord_quickmap(). I also made some modifications to the theme, so it looks better.

5.2.1 example from the Eurostat package

The vignette contains the full tutorial on how to use the eurostat package to get data through the eurostat API. If you are interested check it out later.

For this example we are going to use the Eurobarometer data to plot trust in the EU. We select the country iso codes and the relevant item (q8a_10). Then we create a proper factor with the mutate function, as well as recode some obscure country codes that would cause problems with merging further down the line. For this we do this ugly looking nested dplyr::if_else chain. Finally we drop NAs.

Then we see what is the proportion of people trusting the EU in each member state.

Finally, we download the EU shapefile via the eurostat package.

To plot the data on the map we have to combine the shapefile and the survey data with a left_join.

Now we are ready to plot! We use geom_sf() to fill the countries with our data and coord_sf to have the map projected between the given coordinates. For the color scale, let’s use the viridis scale. Altough we use the theme_minimal we can make additional changes to the theme by adding the theme() call last.

If we want to add arbitrary points to the map, we can do that by specifying the longitudinal and latitudinal coordinates.

Then just plot over our map with the geom_point()